Unique estimation in EEG analysis by the ordering ICA

Independent Component Analysis (ICA) is a method for solving blind source separation problems. Because ICA only needs weak assumptions to estimate the unknown sources from only the observed signals, it is suitable for Electroencephalography (EEG) analysis. A serious disadvantage of the traditional ICA algorithms is that their results often fluctuate and do not converge to the unique and globally optimal solution at each run. It is because there are many local optima and permutation ambiguities. We have recently proposed a new ICA algorithm named the ordering ICA, a simple extension of Fast ICA. The ordering ICA is theoretically guaranteed to extract the independent components in the unique order and avoids the local optima in practice. This paper investigated the usefulness of the ordering ICA in EEG analysis. Experiments showed that the ordering ICA could give unique solutions for the signals with large non-Gaussianity, and the ease of parallelization could reduce computation time.


Introduction
Electroencephalography (EEG) is a method of monitoring the electrical activity in the brain [1]. As EEG is a non-invasive and low-cost method without imposing a heavy burden on subjects, it is widely used for measuring brain activity. On the other hand, EEG has only a limited number of noisy channels, and its observed results fluctuate according to the subjects and trials. Therefore, EEG needs advanced and robust techniques for extracting the essential components from such noisy and limited signals. Independent component analysis (ICA) is a widely used technique for analyzing EEG signals [2]. One of the most significant advantages of ICA is its versatility. ICA needs only a simple assumption on the model generating observed signals (the linear mixture of non-Gaussian sources). Therefore, ICA is suitable for EEG analysis where an unknown mixture model changes according to the subjects and trials. An ICA-based analysis tool named EEGLAB has been widely used [3]. One disadvantage of ICA is that it often generates a different solution at each run for the same dataset, unlike principal component analysis (PCA). It reduces the robustness of the analysis. This indeterminateness is caused by the non-linearity of the objective functions of ICA, which makes ICA have many local a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 minima. As non-linearity is essential to ICA, it is difficult for the previous ICA methods to solve this problem.
Recently, we proposed a new ICA method named the ordering ICA [4]. The ordering ICA is theoretically guaranteed to find the unique solution of ICA. Though it was much slower than the previous ICA methods, the paper suggested that parallel processing could accelerate the ordering ICA. This paper constructs a parallel implementation on multi-core CPUs of the ordering ICA and applies it to EEG analysis. The experimental results show that our implementation can extract the essential components more robustly than the previous methods with modest computation time.
This paper is organized as follows. In Section 2, ICA and the ordering ICA are briefly explained. In Section 3, the parallel implementation of the ordering ICA is detailed. Section 4 shows the experimental results of our implementation in EEG analysis. Lastly, we conclude this paper in Section 5.

Independent component analysis and EEGLAB
Generally, ICA is used for extracting unknown sources from observed signals in an unsupervised manner [5,6]. ICA assumes the linear mixture model X = A S, where X = (x im ) and S = (s im ) (N × M matrices) are the observed signals and the unknown sources, respectively. A = (a ij ) (an N×N matrix) is the unknown mixing matrix. N and M are the dimension of signals and the sample size, respectively. ICA extracts the unknown components in A so that the sources in S are statistically independent as much as possible. ICA is solved as an optimization problem, where the objective function is based on some higher-order statistics such as kurtosis. The objective function is derived from an assumption that each source is given independently according to a non-Gaussian distribution. It has been well known that ICA is quite useful for analyzing EEG signals from brain activity. EEGLAB is the most widely used tool implemented in MATLAB [3].
There are many methods of ICA according to the choice of the objective function and the optimization algorithm. The three representative ICA methods are employed in EEGLAB: the extended InfoMax [7] (referred to as InfoMax), Fast ICA [8], and JADE [9]. Many other methods are variations of the above three methods. The third method, JADE, is not applicable to high-dimensional signals. As the EEG channel is generally more than 60, JADE can not extract the essential components from EEG signals. Therefore, we focus on InfoMax and Fast ICA here. InfoMax employs a stochastic gradient algorithm for optimizing one of two hyperbolic tangent based objective functions depending on the currently estimated source distribution. Though InfoMax can find essential components robustly, the convergence of InfoMax is slow. Fast ICA employs Newton-Raphson method in the optimization of kurtosis-based objective functions. The advantage of Fast ICA is the speed of its convergence. The disadvantage is that the succeeding components tend to fluctuate more largely. One reason is that the usual approach of Fast ICA (called the deflation approach) estimates the components one by one under the orthonormality constraint. The errors accumulated by the successive estimation of components cause the fluctuations. On the other hand, such accumulated errors do not occur in InfoMax because InfoMax estimates all the components simultaneously. InfoMax has been preferred to Fast ICA in EEG analysis because of its robustness.

Ordering ICA
The ordering ICA is our recently-proposed method [4], which is a variation of the deflation approach of Fast ICA. In summary, the ordering ICA finds the globally optimal component at each deflation step by repeating the Newton-Raphson method with many different initializations and selecting the best solution from the multiple candidates. Here, we explain the ordering ICA in brief. See [4] for the details.
First, we employed a novel objective function in the ordering ICA. Let w i = (w i1 , . . ., w iN ) be the i-th row of the separating matrix W, which is expected to be the inverse of the mixing matrix A. Y = (y im ) = W X denotes the estimated sources. The objective function of the i-th component in the ordering ICA is given as follows: where Note that a i ¼ P m y 4 im =M À 3 is the estimated kurtosis of the i-th row of Y if the row is normalized. Now, the following theorem holds: Theorem 1 We assume that S consists of normalized independent sources and does not include any uniform Bernoulli variable. We also assume that X is given by an invertible linear ICA model X = AS in the real domain. Then, all the non-Gaussian sources are extracted in descending order of Y(κ i ) if each F i is globally maximized subject to ∑ m y im = 0 and ∑ m y im y jm / M = δ ij for j � i (the Gram-Schmidt orthonormalization), where κ i is the true kurtosis of the i-th row of S.
In other words, by globally maximizing F i with respect to w i one by one for each i in the Gram-Schmidt orthonormalization, the true separating matrix W (and the true mixing matrix A = W −1 ) can be estimated uniquely if the assumptions are satisfied. κ i can be estimated by α i in practice.
Next, the algorithm for maximizing F i in the Gram-Schmidt orthonormalization is described. It is easily proved from the convexity of U (α i ) that the global maximum of F i = U (α i ) is one of local maxima and minima of α i . Therefore, we can utilize the deflation approach of the Fast ICA algorithm on the kurtosis [8]. As it finds a local minimum or a local maximum in each run by the Newton-Raphson method, we can find the global maximum of F i without fail if the starting point of each run is sampled sufficiently densely from the entire space of w i . Actually, by using Fast ICA to generate many candidates with different random initializations and selecting the best result with the highest F i , the global maximum is expected to be found in most cases. This algorithm estimates W from given X and is called "the ordering ICA" (Algorithm 1). Though the original ordering ICA [4] can estimate the number of independent components adaptively, we fix the number of independent components as N in this paper.
Algorithm 1 Ordering ICA In the ordering ICA, many candidates (the number is denoted by L) improve the robustness of the results. Because the total computation time is linear to L, the efficiency advantage of Fast ICA is lost if a naive implementation is employed.

Implementation
Here, we describe the parallel implementation of the ordering ICA. The time-consuming part of ordering ICA is multiple Fast ICA runs with different initializations (the lines 9-13 of Algorithm 1). Because the runs are independent, we can execute the runs in parallel. As the widely used tool EEGLAB was implemented on MATLAB, we implemented this using Parallel Computing Toolbox of MATLAB. We refer to this implementation as the parallel-ordering ICA. See the released code (ParallelOrderingICA.m) for the details at https://github.com/ yoshitatsumatsuda/orderingICA/blob/master/ParallelOrderingICA.m.
Notably, this implementation uses a simple convergence condition in the FastICA function. If we run the FastICA function just once, we need to estimate w with a different initialization if it fails to converge. The parallel-ordering ICA can skip such a reestimation unless all FastICA runs fail to converge.
The hyper-parameter L (the number of parallel candidates) is set to be R times as large as the number of available cores in the given system. The implemented code is given by L = feature('numcores') � R; in Parallel Computing Toolbox. R is the overhead rate. The default value of R is set to 1. By using larger R, we can increase the number of candidates L at the cost of computational overhead. If the convergence threshold � is sufficiently small, a slight difference does not have significant effects on the results. So, we just set � to 10 −6 . The maximum number K of iterations in each FastICA function is set to 30. K is relatively small because the convergence failures of some FastICA runs are no problems.

Experiments on EEG analysis
To verify the usefulness of the proposed method, we used several public EEG datasets and investigated the results from the viewpoint of robustness, the computation time, and the relation between the degree of non-Gaussianity and the success rate of finding a unique solution.

Investigation of the robustness and the computation time
To verify the robustness and efficiency of the parallel-ordering ICA, we investigated the averaged fluctuations of the solutions over the different runs and the averaged computation time in 5SUBJECTS and STERN. MATLAB R2022a with Parallel Computing Toolbox was employed. The implementation of the parallel-ordering ICA (called Ordering ICA) in Section 3 is applied directly to each EEG signal. For comparison, we applied the extended InfoMax of EEGLAB 2021.1 (called InfoMax) at https://github.com/sccn/eeglab and the Fast ICA using the kurtosis (FastICA 2.5) at https://research.ics.aalto.fi/ica/fastica/ to the same signals. We did not apply JADE because N was too large to employ JADE. We conducted experiments on a 32-core server (dual CPU with 16 cores) with 256GB RAM, where each core is an Intel Xeon 2.1GHz processor. The fluctuation is measured as follows. Each algorithm (Ordering ICA, InfoMax, and Fast ICA) was applied to each EEG signal from different initializations over 10 runs. The number of runs is denoted by T = 10. To remove the permutation ambiguities, w i 's were ordered by U (α i ), a degree of non-Gaussianity, in all the solutions. Note that the number of w i 's is often less than the number of channels N in Fast ICA. It is because the Fast ICA algorithm often fails to converge if the non-Gaussianity of the signal is low. Now, w i in the p-th run is denoted by w p i (p = 1, . . ., 10). To estimate the fluctuations of the results for the different runs, we use the divergence δ(i, p, q) between w p i and w q i , defined as Since the positive or negative sign of w i is arbitrary in ICA, the absolute value of the cosine similarity is employed. δ(i, p, q) takes the minimum 0 if and only if w p i and w q i have the same or opposite directions. When w p i is orthogonal to w q i , δ(i, p, q) takes the maximum 1. If w p i (or w q i ) does not exist due to non-convergence in Fast ICA, δ(i, p, q) was set to the maximum 1.
Consequently, the fluctuation of w i over all the runs is evaluated as the average of δ(i, p, q), defined as First, the experimental results on 5SUBJECTS are shown. The number of candidates L in Ordering ICA was set from 8 to 128 by increments of 8. Using the 32-core server, the overhead rate R for the 32-core server was set from 0.25 to 4 by increments of 0.25. Fig 1 shows   In summary, the ordering ICA algorithm with L = 64 is always superior to the widely used InfoMax algorithm from the viewpoint of robustness (measured by the averaged fluctuation). In addition, the computation time of the parallel-ordering ICA algorithm with even L = 128 is at most twice as long as that of InfoMax when a 32-core server can be utilized. Though the computation time is roughly linear to L, we can reduce the linear factor by using more cores. These results verify the usefulness of the proposed parallel implementation of the parallelordering ICA in EEG analysis.

Relation between non-Gaussianity and success rate of finding unique solution
As shown in the preceding section, the parallel-ordering ICA algorithm can find a unique solution in many cases for the components with large non-Gaussianity. However, it is not guaranteed to succeed in all cases because it is a randomized algorithm. Here, the success rate is evaluated empirically. We also discuss the failure rate of the parallel ordering ICA algorithm and the appropriate choice of the number of candidates L. In addition to 5SUBJECTS and STERN, TUH is employed as a dataset to verify the versatility for various EEG signals.
The success rate of finding the unique solution of a component was empirically evaluated as follows. Ordering ICA with the largest number of candidates (L = L max ) is carried out for each EEG signal in a single run. Let w k i be the k-th candidate of the i-th components. The estimated i-th components wk i (selected from all the candidates w k i (k = 1, � � �, L max )) is assumed to be the globally optimal and unique solution. The success rate can be evaluated as the rate of candidates sufficiently near the unique solution wk i . The set of such candidates is defined as where dði; k;kÞ is the divergence between w k i and wk i defined by Eq 3. Then, the success rate μ i of the i-th component is evaluated as where X is the size of the set X. In this paper, L max and � were set to 1000 and 0.001, respectively. Note that the minimum of � m i is 1/L max . If every candidate of a component diverges, the component is neglected in the following analysis. Fig 7 shows the averaged success rate of each component (sorted by non-Gaussianity) for 5SUBJECTS, STERN, and TUH. Here, TUH is higher than 5SUBJECTS and STERN, and 5SUBJECTS is slightly higher than STERN. The difficulty of analyzing each dataset is probably  proportional to the number of channels. It shows that TUH is easier to solve than 5SUBJECTS and STERN. Fig 7 also shows that the success rate decreases as the non-Gaussianity decreases for the upper-ranked components. On the other hand, the success rate tends to increase as the non-Gaussianity decreases for the lower-ranked components. It is probably because the degree of freedom in the deflation approach is lower for the lower-ranked components. For example, the solution of the last component is necessarily unique (namely, � m i ¼ 1) in a single run. Fig 8 shows the relation between the non-Gaussianity (measured by log 10 U (α i ) of Eq 2) and the success rate (measured by log 10 � m i ) for all the upper-ranked components of the three datasets: 5SUBJECTS (the upper-ranked 40 components), STERN (the upper-ranked 40 components), and TUH (the upper-ranked 15 components). It shows that the relation between log 10 Y(α i ) and log 10 � m i is approximately linear. The worst success rate is estimated bŷ Then, the total failure rate n total i of failing to find a unique solution is given by the following recurrence relation: where n total 0 ¼ 0. Therefore, we can approximately evaluate the failure rate of the upper-ranked component by the non-Gaussianity Y(α i ). Fig 9 plots all the components of all datasets with the non-Gaussianity on the X-axis and the success rate on the Y-axis. It also displays the histograms of each axis. It shows that there are some bounds. For log 10 U (α i )>1 (namely, U (α i )>10), � m i > 10 À 1:5 holds in almost all cases. Therefore, the failure rate n each i of the components with U (α i )>10 is bounded from above by (1 −10 −1.5 ) L ' 10 −0.0139L . For example, n each i < 0:0164 for L = 128. If we employ a larger L (for example, L = 256), n each i can be reduced sufficiently (n each i < 3 � 10 À 4 ). The total failure rate also can be reduced (n total i < i � 3 � 10 À 4 ) if U (α i )>10 holds in the top-ranked i components. Because the number of components with U (α i )>10 and the total number of components are 1096 and 5713 in the three datasets, the top 20 percent of the components with large non-Gaussianity are expected to be estimated uniquely for L = 256. On the other hand, some components have � m i < 10 À 1:5 if log 10 U (α i )<1. Finding a unique solution for such components is often difficult. For example, the evaluated � m i was often the smallest value of 1/1000. It shows that Ordering ICA failed to find a unique solution even for L = 1000 in such cases. Because the number of the components with � m i < 10 À 1:5 is 392 in the three datasets, they are expected to be less than 10 percent of all components. Nevertheless, it should be noted that the total failure rate n total i can be drastically high if the estimation of a preceding component fails. In summary, it was observed that there was a linear regression model between the logarithm of the non-Gaussianity and that of the success rate. We can estimate the failure rate of finding a unique solution by the model. In addition, the failure rate was low enough to find a unique solution almost certainly by Ordering ICA with an available number of candidates (for example, L = 256) if the non-Gaussianity is relatively large (U (α i )>10).

Conclusion
This paper proposed the parallel-ordering ICA algorithm. The experimental results on standard EEG datasets verified that our proposed implementation is superior to the widely-used InfoMax of EEGLAB in both stability and efficiency.

PLOS ONE
We plan to investigate the usefulness of the parallel-ordering ICA algorithm in practical EEG analysis. In addition, we are planning to accelerate the parallel computation by GPUs and distributed systems. Especially, the utilization of the distributed systems may be promising because the parallelization of the ordering ICA is quite simple.